An empirical study of the effect of outliers on the misclassification error rate

نویسندگان

Edgar Acuña

Caroline Rodríguez

چکیده

An outlier is an observation that deviates so much from other observations that it seems to have been generated by a different mechanism. Outlier detection has many applications, such as data cleaning, fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that exhibit a behavior that is very different from most of the individuals of the data set. Frequently, outliers are removed to improve accuracy of estimators, but sometimes, the presence of an outlier has a certain meaning, which explanation can be lost if the outlier is deleted. In this paper we study the effect of the presence of outliers on the performance of three well-known classifiers based on the results observed on four real world datasets. We use detection of outliers based on robust statistical estimators of the center and the covariance matrix for the Mahalanobis distance, detection of outliers based on clustering using the partitioning around medoids (PAM) algorithm, and two data mining techniques to detect outliers: Bay’s algorithm for distance-based outliers, and the LOF, a density-based local outlier algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection of Outliers and Influential Observations in Linear Ridge Measurement Error Models with Stochastic Linear Restrictions

The aim of this paper is to propose some diagnostic methods in linear ridge measurement error models with stochastic linear restrictions using the corrected likelihood. Based on the bias-corrected estimation of model parameters, diagnostic measures are developed to identify outlying and influential observations. In addition, we derive the corrected score test statistic for outliers detection ba...

متن کامل

تحلیل وضعیت آنژین صدری بر اساس احتمالات طبقه بندی نادرست عامل خطر سیگار در مطالعه قند و لیپید تهران، 79-1378

Misclassification of disease status and risk factors is one of the main sources of error in studies. Wrong assignment of individuals into exposed and non-exposed groups may seriously distort the results in case-control studies. This study investigates the effect of misclassification error on odds ratio estimates and attempts to introduce a correction method. Data on 3332 men aged 30-69 years fr...

متن کامل

A New Formulation for Cost-Sensitive Two Group Support Vector Machine with Multiple Error Rate

Support vector machine (SVM) is a popular classification technique which classifies data using a max-margin separator hyperplane. The normal vector and bias of the mentioned hyperplane is determined by solving a quadratic model implies that SVM training confronts by an optimization problem. Among of the extensions of SVM, cost-sensitive scheme refers to a model with multiple costs which conside...

متن کامل

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

Separating Well Log Data to Train Support Vector Machines for Lithology Prediction in a Heterogeneous Carbonate Reservoir

The prediction of lithology is necessary in all areas of petroleum engineering. This means that to design a project in any branch of petroleum engineering, the lithology must be well known. Support vector machines (SVM’s) use an analytical approach to classification based on statistical learning theory, the principles of structural risk minimization, and empirical risk minimization. In this res...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

An empirical study of the effect of outliers on the misclassification error rate

نویسندگان

چکیده

منابع مشابه

Detection of Outliers and Influential Observations in Linear Ridge Measurement Error Models with Stochastic Linear Restrictions

تحلیل وضعیت آنژین صدری بر اساس احتمالات طبقه بندی نادرست عامل خطر سیگار در مطالعه قند و لیپید تهران، 79-1378

A New Formulation for Cost-Sensitive Two Group Support Vector Machine with Multiple Error Rate

Identification of outliers types in multivariate time series using genetic algorithm

Separating Well Log Data to Train Support Vector Machines for Lithology Prediction in a Heterogeneous Carbonate Reservoir

عنوان ژورنال:

اشتراک گذاری